{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "residential-canada",
   "metadata": {},
   "source": [
    "## Define the current running path to save output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "equipped-blink",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.chdir(\"/home/xiongyx/mtANN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "personalized-component",
   "metadata": {},
   "source": [
    "## 0. Import python modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "nervous-kuwait",
   "metadata": {},
   "outputs": [],
   "source": [
    "from mtANN import model\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "worse-kinase",
   "metadata": {},
   "source": [
    "## 1. Read in data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "actual-morgan",
   "metadata": {},
   "source": [
    "The input data considered by the current version of mtANN is in csv format, where rows are samples and columns are features. In addition, its cell type information is stored in another csv file, and its naming format is the name of the dataset +_label.\n",
    "    \n",
    "Here we use the Pancreas data as an example: Download the dataset file and unzip it. Then move everything in datasets/ to data/pancreas/."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "drawn-fighter",
   "metadata": {},
   "source": [
    "### 1.1 Get the names of all sequencing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "buried-determination",
   "metadata": {},
   "outputs": [],
   "source": [
    "files,_=os.walk(\"./datasets/panc/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "approximate-miracle",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['human' 'muraro' 'seg' 'xin']\n"
     ]
    }
   ],
   "source": [
    "file = []\n",
    "for f in files[2]:\n",
    "    file.append(f.split(\"_\")[0])\n",
    "tech = []\n",
    "for f in file:\n",
    "    tech.append(f.split(\".\")[0])\n",
    "tech = np.unique(tech)\n",
    "print(tech)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "israeli-denver",
   "metadata": {},
   "source": [
    "### 1.2 read in target data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "weighted-accused",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "target data is human\n"
     ]
    }
   ],
   "source": [
    "dt = 0\n",
    "target = tech[dt]\n",
    "print(\"target data is {}\".format(target))\n",
    "target_dataset = {}\n",
    "target_dataset[\"expression\"] = pd.read_csv(\"./datasets/panc/{}.csv\".format(target), header=0, index_col=0)\n",
    "target_dataset[\"expression\"] = target_dataset[\"expression\"].div(target_dataset[\"expression\"].apply(lambda x: x.sum(), axis=1), axis=0) * 10000\n",
    "target_dataset[\"cell_type\"] = pd.read_csv(\"./datasets/panc/{}_label.csv\".format(target), header=0, index_col=False).x.values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "contrary-measurement",
   "metadata": {},
   "source": [
    "### 1.3 read in remaining data as reference datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "established-coupon",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "source is muraro\n",
      "source is seg\n",
      "source is xin\n"
     ]
    }
   ],
   "source": [
    "epr_s = list()\n",
    "label_s = list()\n",
    "for ds in set(np.arange(len(tech))).difference(set([dt])):\n",
    "    source = tech[ds]\n",
    "    print(\"source is {}\".format(source))\n",
    "    source_dataset = {}\n",
    "    source_dataset[\"expression\"] = pd.read_csv(\"./datasets/panc/{}.csv\".format(source), header=0, index_col=0)\n",
    "    source_dataset[\"cell_type\"] = pd.read_csv(\"./datasets/panc/{}_label.csv\".format(source), header=0, index_col=False).x.values\n",
    "    source_dataset[\"expression\"] = source_dataset[\"expression\"].div(source_dataset[\"expression\"].apply(lambda x: x.sum(), axis=1), axis=0) * 10000\n",
    "    epr_s.append(source_dataset[\"expression\"])\n",
    "    label_s.append(source_dataset[\"cell_type\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rising-rider",
   "metadata": {},
   "source": [
    "## 2. Fit mtANN model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "expanded-coating",
   "metadata": {},
   "source": [
    "There are six parameters to input:\n",
    "\n",
    "**expression_s**: A list of gene expression matrices for reference datasets, each matrix is formatted as a dataframe object where rowa are cells and columns are genes. <br>\n",
    "**label_s**: A list of cell-type labels corresponding to references in expression_s. Its length is equal to the length of expression_s.<br>\n",
    "**expression_t**: The gene expression matrix of target dataset whose format is the same as reference datasets.<br>\n",
    "**threshold**: Either be default or a number between 0~1. This parameter indicates that the threshold for unseen cell-type identification is selected using the method's default threshold or user-defined.<br>\n",
    "**gene_select**: Either be default or others. The \"default\" means that the default eight gene selection methods are used. Other values indicate that all the genes in expression_s are used.<br>\n",
    "**CUDA**: A logic parameter. It indicates whether to use gpu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "progressive-array",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selecting genes with default methods\n",
      "Convert 0-th ref to R object\n",
      "Convert 1-th ref to R object\n",
      "Convert 2-th ref to R object\n",
      "the number of references is 3 \n",
      "Selecting genes with gc\n",
      "Gene number is 12570\n",
      "Cell number is 2119\n",
      "Selecting genes with gc\n",
      "Gene number is 14288\n",
      "Cell number is 2108\n",
      "Selecting genes with gc\n",
      "Gene number is 14912\n",
      "Cell number is 1492\n",
      "training 1-th classification model\n",
      "training 2-th classification model\n",
      "training 3-th classification model\n",
      "training 4-th classification model\n",
      "training 5-th classification model\n",
      "training 6-th classification model\n",
      "training 7-th classification model\n",
      "training 8-th classification model\n",
      "training 9-th classification model\n",
      "training 10-th classification model\n",
      "training 11-th classification model\n",
      "training 12-th classification model\n",
      "training 13-th classification model\n",
      "training 14-th classification model\n",
      "training 15-th classification model\n",
      "training 16-th classification model\n",
      "training 17-th classification model\n",
      "training 18-th classification model\n",
      "training 19-th classification model\n",
      "training 20-th classification model\n",
      "training 21-th classification model\n",
      "training 22-th classification model\n",
      "training 23-th classification model\n",
      "training 24-th classification model\n"
     ]
    }
   ],
   "source": [
    "mid_annotation, final_annotation, m, threshold = mtANN(expression_s = epr_s, label_s=label_s, expression_t=target_dataset[\"expression\"], threshold=\"default\", gene_select=\"default\", CUDA = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "joined-committee",
   "metadata": {},
   "source": [
    "We can obtain four output:\n",
    "\n",
    "**mid_annotation**: A numpy object which is the metaphase annotation results. <br>\n",
    "**final_annotation**: A numpy object which is the annotation results with \"unassigned\" cell selected by default threshold. <br>\n",
    "***m***: A numpy object which is the unseen cell-type identification metric. <br>\n",
    "**threshold**: A numeric object which is the selected default threshold."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "initial-phenomenon",
   "metadata": {},
   "source": [
    "## 3. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "dominant-namibia",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The annotation accuracy with the default threshold is 0.8704741882737678\n"
     ]
    }
   ],
   "source": [
    "accuracy = sum(final_annotation[:,0] == target_dataset[\"cell_type\"])/final_annotation.shape[0]\n",
    "print(\"The annotation accuracy with the default threshold is {}\".format(accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "raised-roman",
   "metadata": {},
   "source": [
    "## 4. Save all the results to output dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "british-crash",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(mid_annotation).to_csv(\"./output/mid_annotation_{}.csv\".format(target), index=False)\n",
    "pd.DataFrame(final_annotation).to_csv(\"./output/final_annotation_{}.csv\".format(target), index=False)\n",
    "pd.DataFrame(m).to_csv(\"./output/metric_{}.csv\".format(target), index=False)\n",
    "pd.DataFrame(threshold).to_csv(\"./output/threshold_{}.csv\".format(target), index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
